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Abstract 

Spectral clustering (SC) and graph-based 
semi-supervised learning (SSL) algorithms 
are sensitive to how graphs are constructed 
from data. In particular if the data has 
proximal and unbalanced clusters these al- 
gorithms can lead to poor performance on 
well-known graphs such as fc-NN, full-RBF, e- 
graphs. This is because the objectives such as 
Ratio-Cut (RCut) or normalized cut (NCut) 
attempt to tradeoff cut values with cluster 
sizes, which are not tailored to unbalanced 
data. We propose a novel graph partition- 
ing framework, which parameterizes a family 
of graphs by adaptively modulating node de- 
grees in a fc-NN graph. We then propose a 
model selection scheme to choose sizable clus- 
ters which are separated by smallest cut val- 
ues. Our framework is able to adapt to vary- 
ing levels of unbalancedness of data and can 
be naturally used for small cluster detection. 
We theoretically justify our ideas through 
limit cut analysis. Unsupervised and semi- 
supervised experiments on synthetic and real 
data sets demonstrate the superiority of our 
method. 



1 Introduction 

Data with unbalanced clusters arises in many learn- 
i ng applications an d has attracted much interest 
( He fc Garcia . 20091 ). In this paper we focus on 
graph-based spectral methods for clustering and semi- 
supervised l earning (SSL) tasks. W hile model-based 
approaches ( Fralev fe Raftervl . 12002 ) may incorporate 
unbalancedness, they typically assume simple clus- 
ter shapes and need multiple restarts. In contrast 



non-parametric graph-based approaches do not have 
this issue and a re able to capture complex shapes 
(|Ng et all 120011) . In spectral methods a graph rep- 
resenting data is first constructed. Then a graph- 
based le arning algorithm such as spectral clust er- 
ing(SC) dHaeen fe Ka"hnl.ll992HShi fc Malild.l2000h or 
SSL algorithms (|Zhul . 120081: IWang et all 120081 ) is ap- 
plied on the graph. Of the two steps, g raph construc- 
tion has been iden t ified to be important (Ivon Luxburg 



20071: iMaier et all . l2008at IJebara et aUl2009l) . and we 



will see is critical in the presence of unbalanced prox- 
imal clusters. Common graph construction methods 
include e-graph, fully-connected RBF-weighted(full- 
RBF) graph and fc-nearest neighbor(fc-NN) graph. Of 
the three fc-NN graphs appears to be mos t popular 



due to its relative robustness to outliers ( Zhu . 2008; 



von Luxburgl . 120071 ). 



Drawbacks of spectral meth ods on unbalanced data 
have been documented: Zclnik-Manor fc Perona 



(|2004l ) suggests an adaptive RBF parameter f o r full- 
RBF graph. More recently, iNadler fc Galunl (|2006h 
describe these drawbacks from a random walk per- 
spective. Nevertheless, to the best of our knowl- 
edge, there does not exist systematic ways of adapt- 
ing spectral methods to possibly u nbalanced data - 
Ther e are other spe ctral methods (jBuhler fc Hein , 
20091 IShi et all 120091 ) that are claimed to be able to 
handle unbalanced clusters better than standard SC. 
However, they do not look into unbalanced data specif- 
ically; meanwhile our framework can be combined with 
these methods. Also related is size-constrained clustcr- 



ing (ISimon fc Tend . 119971 : iHoppner fc KlawonnL I200S, 
Zhu et all l2010h which imposes constraints on the 
number of points per cluster. This is a different prob- 
lem because with size constraints the partitions may 
not be low-density cuts, while our clustering goal here 
is to find natural partitions separated by density val- 
leys - clusters could be unbalanced but we do not know 
a priori how unbalanced they are. 

The poor performance of spectral methods in the pres- 
ence of unbalanced clusters is a result of minimizing 
RatioCut (RCut) or normalized cut (NCut) objective 
on these graphs, which seeks a tradeoff between min- 
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imum cut- values and cluster sizes. While robust to 
outliers, this sometimes leads to meaningless cuts. In 
Section 2 we illustrate some of the fundamental issues 
underlying poor performance of spectral methods on 
unbalanced data. We then describe a novel graph- 
based learning framework in Section 3. Specifically we 
propose to parameterize a family of graphs by adap- 
tively modulating node degrees based on the ranking 
of all samples. This rank-modulated degree (RMD) 
strategy asymptotically results in reduced (increased) 
cut- values near density valleys (high-density areas). 
Based on this parametric scheme we present a model 
selection step that finds the lowest-density partition 
with sizable clusters. Our approach is able to han- 
dle varying levels of unbalanced data and detect small 
clusters. We explore the theoretical basis in Section 4. 
In Section 5 we present experiments on synthetic and 
real datasets to show significant improvements in SC 
and SSL results over conventional graphs. Proofs ap- 
pear in supplementary section. 

2 Problem Definition 

We describe an abstract continuous setting to describe 
our problem. Assume that data is drawn from some 
unknown density f(x), where x G M. d . For simplicity 
we consider binary clustering problems but our setup 
generalizes to arbitrary number of partitions. We seek 
a hypersurface S that partitions R d into two non- 
empty subsets D and D (with DUD = R d ). 

While there are many ways to formulate partitioning 
problems we formulate the goal of binary partitioning 
to find a hypersurface that passes through minimum 
density regions, namely, 



So^argmin/ ip(f(s))ds 



(1) 



where ip(-) is some positive monotonic function. This 
goal is too simplistic that the resulting partitions could 
be empty. Consequently, we need to constrain the 
measures, mm{n(D), fi(D)} > 6 for some 5 > 0, to 
ensure meaningful partitions, where n(A) = Prob{x £ 
A}. Certainly the optimal hypersurface Sq may not 
necessarily be balanced. 

Definition 1. We say the data is a- unbalanced if the 
hypersurface So results in partitions, (Do, Do), with: 

min{/i(A)),M(A))} = a<l/2- 

We now focus on finite sample objective mirroring the 
continuous objective of Eq.([T]). Let G = (V,E) be 
a graph constructed using n samples in some manner 
consistent with the underlying topology of the ambient 
space. We denote by S a cut that partitions V into 



Cs and Cs- The cut- value associated with S is: 

Cut(C s ,C s )= ( 2 ) 

uGCs.f GCs,0,i>)6-E 

The empirical variant of Eq. ([1]) is to minimize the cut- 
value subject to sizable cluster constraints: 

S. = ar S mm{Cut(Cs,C s ) | min{|C s |, \C S \} > 5\V\} 

(3) 

We assume that the cut S* results in (C*, 

2.1 Graph Partitioning Algorithms 

Existing graph partitioning algorithms aim to min- 
imize various objectives on the grap h. The min- 
cut approach (jStoer fc Wagnerl . Il997l ) directly mini- 
mizes the cut-value Eq.@. While simple and efficient, 
this method could suffer from serious outlier problems 
without sizable cluster constraints. The popular SC 
algorithms attempt to minimize RCut or NCut: 



RCut(Cs,Cs) = Cut(Cs,C s ) 



\Cs\ 



\Cs\J 



(4) 



where vol(C) — XLec veV w { u i v )- Both NCut and 
RCut seek to trade-off low cut-values against cut size. 
While robust to outliers, minimizing RCut(NCut) on 
traditional graphs can fail when data is unbalanced 
(i.e. with small a of Def.l). To further motivate this 
issue, we first define two quantities: cut-ratio, q, and 
unbalancedness coefficient, y, associated with optimal 
cuts resulting from Eq. {[3j) and any balanced cut S^: 



Q = 



cut(c B ,c B y 



y = 



min{size(C*), size(C*)} 
size(C*) + size(C*) 



where size(C) = \C\ for RCut and size(C) = vol(C) 
for NCut. The cut Sb is associated with (Cb, Cb) 
and is balanced, i.e., s«ze(Cs) = swe(Cs). Note that 
y € [0,0.5] is an empirical measure of unbalancedness 
of the optimal cut S 1 *, while q £ [0, 1] is the proportion 
of cut-value at a "density valley" to that of a balanced 
cut. Next we characterize the necessary condition for 
minimizing RCut (NCut) to work correctly. 

Proposition 1. SC fails, i.e., the RCut/NCut values 
of balanced cuts Sb are smaller than that of S* (ob- 
tained in Eq.^ty), whenever q > 4y(l — y). 

The proof follows by direct substitution. Prop.l sug- 
gests (see Fig[T]) that if the unbalancedness, y, is suf- 
ficiently small, say, 0.15, then the cut value at the 
"density valley" has to be more than twice as deep for 
RCut/NCut to be effective. 
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y ( smaller cluster proportion ) 



Figure 1: Cut-ratio (q) vs unbalancedness (y). RCut value 
is smaller for balanced cuts than unbalanced low-density 
cuts whenever the cut-ratio is above the curve. 



Examining 
graph and 



for fc-NN, 



the limit behavior 
full-RBF graphs |Maier et all 12008a : 



Narayanan et al. , 2006) is instructive to understand 
the pair (q,y). For properly chosen fc„, a„ and e n 
respectively, as the number of samples n — > oo, q and 
y converge with high probability to: 



l Sn r(*)<& 



y -» min{^(£)o), fi(D )} 



(5) 



where 7 is a graph-dependent constant^. So is the 
solution to Eq.flT]) resulting in (D ,D ). Essentially, 
each graph construction corresponds to a point (q, y) 
on Fig.l. The issue with traditional graphs is that with 
q unchanged, the data could be so unbalanced that y 
falls above the curve on Fig.l, leading RCut(NCut) to 
pick the balanced cut! 

2.2 Our Algorithm 

In order to adapt RCut(NCut) based algorithms to 
unbalanced data, our main point is that the "balancing 
term" is difficult to manipulate since it corresponds to 
the number of samples. On the other hand, the cut- 
ratio q can be controlled by adoptively 'parameterizing 
different graphs on the same node set. 

To motivate this idea consider binary partitioning on 
a weighted graph G — (V, E , Wo). The node set V 
is associated with samples, while Eq, Wo are obtained 
through fc-NN, full-RBF or e-graph or a combination 
thereof. The important point for future reference is 
that the graph Go = (V,Eo,Wo) is fixed. We seek a 
cut (G, C) for this graph that satisfies Eq. (j3|) . Con- 
ventional methods rely on a number of different graph 
partitioning techniques such as RCut/NCut based SC 
to obtain partitions. We have argued that this can lead 
to skewing the cut towards balanced cuts that are not 
representative of actual clusters. 



To account for unbalancedness we adopt a new strat- 
egy here. The idea is to parameterize a family of 
graphs over a parametric space, A £ A, with differ- 
ent edge sets on the same node set. 

G(X) = (V,E(X),W(X)), A e A 

We will see that the mapping E(X),W(X) allows for 
asymmetrical emphasis between low vs. high density 
for different choices of parameters. A number of graph 
partitioning techniques such as RCut/NCut based SC 
can now be applied on graphs with different choices of 
A € A to obtain different partitions. We can thus ob- 
tain a mapping from A € A to a partition (C(A), G(A)) , 



A 



(G(A),G(A)) 



Then we can evaluate the cut value of these partitions 
on the reference graph Go, namely, 

Cut (C(X),C(X)) = ^ w (u,v)1 uv€ e 

We can then pick the A (the partition) that minimizes 
the cut value under the constraint that each cluster 
has at least a S fraction of the samples, 



A* = argmin{Cuto(C*(A),C(A)) 



x For fc-NN 7 < 1 and 7 
graphs. 



S [1,2] for e and full-RBF 



n{|C(A)|, \C(X)\}>8\V\} 
(6) 

and output (C(A*), G(A*)) as the optimal partition. 
Notice our framework exactly aims at the optimal cri- 
terion EqOJl. 

This motivates how to parameterize a family of graphs 
to obtain rich binary partitioning structures: 

(1) Adaptively modulate the degree k = k(x) node- 
wise based on fc-NN graph, 

(2) the neighborhood size e = e(x) based on e-graph. 
Both strategies are somewhat equivalent. We adopt 
the first scheme since it is easier to explicitly con- 
trol the number of edges to ensure a connected graph. 
Specifically, we propose to modulate node degrees of a 
fc-NN graph through a parametric way based on rank- 
ings of all samples. This rank indicates whether a node 
lies near low/high density areas; therefore degree mod- 
ulation can lead to fewer/more edges at low/high den- 
sity regions. Consequently, for the same y with node 
set fixed, the cut-ratio q is directly reduced, pulling 
down the point (q, y) on Fig.l, for RCut(NCut) based 
algorithms to seek density valley cuts. 

We propose a novel graph partitioning framework in- 
volving the following steps: 

(a) Parameterize a family of graphs with different edge 
sets on the same node set; 

(b) Minimize RCut(NCut) on this family of graphs to 
get a family of partitions; 

(c) Select the best partition that solves Eq.([3|). 
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Figure 2: Mixture of two Gaussians with mixture proportions 0.85 and 0.15. The corresponding mean vectors are 
respectively [4.5; 0], [0;0], and the covariances are diag(2, 1), diag(l, 1). Cut and RCut values for fe-NN, e and full-RBF 
and our graph(RMD) are plotted. Figures in (b),(c) are averaged over 20 Monte Carlo runs. The values are re-scaled 
for demonstration, a is the RBF parameter, dk is the average fc-NN distance. The number of samples is n = 1000, and 
k = 30. For (f) unweighted RMD graph with I = 30, A = 0.4; for (d) unweighted fc-NN; for (e) e = a = d k . The example 
(b) shows that large a in the fc-NN graph results in smoothing of cut-values and the minimum RCut is not at the density 
valley. (c),(e) show that smaller e, a have pronounced sensitivity to outliers (RCut curve goes down near boundaries), 
while large e, a smoothen the RCut value. 



Remark: Note that one could also parameterize a 
family of fc-NN, full-RBF or e-graphs with fc, e, a on 
the same node set. This parameterization is obviously 
is not node-wise adaptive, which is critical for our 
problem. We present an example to demonstrate this 
point. Fig[2la) shows an unbalanced proximal density, 
with a "shallow" valley in the cut-value curves(red) in 
(b),(c). RCut curves on "parameterized" traditional 
graphs and our RMD graph are plotted in (b),(c). 
Note that large values of fc, e and a tend to smooth 
the curve (sometimes even lose the valley) and in- 
crease q, which worsens the problem. In contrast re- 
ducing k, e and a below well- understood thresholds 
leads to zigzag curves, disconnected graphs and sensi- 
tivity to outliers. Basically, increasing/reducing fc,e or 
a results in uniformly larger/smaller cut-values for all 
nodes, leading to poor control of q. On the contrary, 
our rank-modulated degree (RMD) scheme results in 
fewer/more edges for nodes in low/high density areas, 
directly reducing the cut-ratio q. RCut minima on the 
RMD graph (black) tends to be near valleys as seen 
in Fig[2jb),(c). In addition RMD graph also inherits 
from fc-NN the advantage of being robust to outliers, 
as the RCut curve increases near boundaries. 



3 RMD Graphs: Main Steps 

Our RMD graph-based learning framework has the fol- 
lowing steps: 

(1) Rank Computation: The rank R(x) of every 
point x is calculated: 



1 " 



)<G(x,)} 



(7) 



where I denotes the indicator function. Ideally we 
would like to choose G(-) to be the underlying den- 
sity, /(■) of the data. Since / is unknown, we need 
to employ some surrogate statistic. While many 
choices are possible, the statistic in this paper is 
based on nearest-neighbor distances. Such statistics 
have be en employed for high-dimensional anomaly de- 
tection ( Zhao fc Saligram a, 2009; Salig rama fc Z hao. 
20121 ). Details are described in Sec l3. II The rank is a 
normalized ordering of all points based on G, ranges in 
[0, 1], and indicates how extreme x is among all points. 

(2) Parameterized RMD Graphs Construction: 

Build RMD graphs by connecting each point x to its 
deg(x) closest neighbors. The degree deg(x) for node 
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is modulated as follows: 



deg(x) = k(X + 2(1 - X)R(x)), 



(8) 



where A € (0, 1] parameterizes the family of RMD 
graphs, fc is the average degree. We discretize A in 
{0.2,0.4,0.6,0.8,1} in experiments. It is not difficult 
to see that R(x) converges (in distribution) to a uni- 
form measure on the unit interval regardless of the 
underlying density /(•). This implies the average de- 
gree across all samples is k. The minimum degree Xk 
can be used to ensure a connected graph when neces- 
sary. Note that we also vary fc,er in our experiments 
for a more thorough demonstration. 

(3) Graph-based Learning: Apply graph-based 
clustering or SSL algorithms on the family of RMD 
graphs to obtain a family of partitions. RCut/NCut 
based SC algorithms are well established. We use 
both objectives in Secj5l but mainly focus on NCut 
since it has better performance and is recommended. 
For SSL tasks we employ RCut-based Gaussian Ran- 
dom Fields(GRF) and NCut-based Graph Transduc- 
tion via Alternating Minimization(GTAM). These ap- 
proaches all involve minimizing Tr{F T LF) plus some 
constraints or penalties, where L is the graph Lapla- 
cian, F the cluster indicator or classification function. 



This is related to RCut(NCut) minimizati on (IChuns . 
19961). We refer readers to referenc es rtZhul . 2008; 



Wang et all . 120081 : Ivon Luxburi 120071) for details. 



(4) Min-Cut Model Selection: The final step is to 
select the min-cut partition that is meaningful accord- 
ing to Eq.Q. Our main assumption is we have prior 
knowledge that the smallest cluster is at least of size 
Sn. The X-partitions obtained from step (3) are now 
parameterized: (Ci(A, k, ex), Ck{X, k, a)). We pick 
the partition with minimum Cut value (lowest density 
valley) over all admissible choices: 



s.t. min{|Ci(A, k, er)|, |Cr-(A, k, a)\} > Sn 



Algorithm 1: RMD Graph-based Learning: 
Input: n data samples {x±, . . . , x n } (partially 
labeled for SSL), number of clusters/classes K, 
smallest cluster/class size threshold 5. 
Steps: 

1. Compute ranks of samples based on Eq.®. 

2. For different A, fc,cr, do: 

a. Construct the RMD graph based on Eq.©; 

b. Apply graph-based learning algorithms on the 
current RMD graph to get K clusters. 

3. Compute Cut values of different partitions from 
step 2 on the fco-NN graph. Pick the partition with 
the smallest Cut value based on Eq.©. 
Output: the selected if-partition. 

Remark: Our framework improves the graph con- 
struction step, augments with a model selection step 
with desired optimal criterion, but does not change 
graph-based learning algorithms. This implies that 
our framework can be combined with other graph par- 
titioning algorithms to improve performance for un- 
balanced data, s uch a s ratio/normalized Cheeger cut 
dBuhler fc Heinl . I2009T) . 



3.1 Rank Computation 

We now specify the statistic G in rank computation. 
We choose the statistic G in Eq.([7|) based on nearest- 
neighbor distances. 



G(x) 



21 

7 £ D 



(i) 



(,:) 



(10) 



,1+1 



where (x) denotes the distance from x to its i-th 
nearest neighbor, and G is the average of x's (I + l)-th 
to 2/-th nearest neighbor distances. Other choices for 
G are possible. (1) G(x) is the number of neighbors 
within an e-ball of x or (2) G(x) is the distance from 
x to its l-th nearest neighbor. Empirically (and theo- 
retically) we have observed that Eq. (IT0|) leads to bet- 
ter performance and robustness. The ranks are rela- 
tive orderings of points and are quite insensitive to the 
choice of the neighborhood size parameter I. To fur- 
ther reduce variance in rank c omputation we also em- 
ploy a U-statistic technique ([Koroliuk fc Borovskich , 
19941 ) with B times of resampling. 



Partitions with smaller clusters than Sn will be dis- 
carded. Cuto (•) represents the Cut values of different 
partitions are evaluated on a same reference fco-NN 
graph to pick the min-cut partition. This step ex- 
actly aims at the optimal criterion of Eq.([3]). 
Note that whatever RCut/NCut is used, for the above 
size constraint we just consider the number of points 
within the clusters. 



4 Analysis 

Our asymptotic analysis show how graph sparsification 
leads to control of cut-ratio q introduced in Sec. 2. 
Detailed proofs can be found in supplementary section. 
Assume the data set {x\, . . . , x n } is drawn i.i.d. from 
density / in M. d . f has a compact support C. Let 
G = {V,E) be the RMD graph. Given a separating 
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hyperplane S, denote C + ,C as two subsets of C split 
by S, rid the volume of unit ball in R d . 

First we show the asymptotic consistency of the rank 
R(y) at some point y. The limit of R(y), p(y), is the 
complement of the volume of the level set containing y. 
Note that p exactly follows the shape of /, and always 
ranges in [0,1] no matter how / scales. 

Theorem 2. If f(x) satisfies some regularity condi- 
tions, then as n — > oo, we have 



R(y) -> p(y) 



{x:f(x)<f(y)} 



f{x)dx. (11) 



Remark: 

(1) The value of R(x) is a direct indicator of whether 
x lies in high/low density regions(Figj3]). 

(2) R{x) is the integral of pdf asymptotically. It's 
smooth and uniformly distributed in [0, 1]. This makes 
it appropriate to modulate the degrees with control of 
minimum, maximal and average degree. 





NCut n {S) 



C d B s / / 
Is 



*(s)p(s) 1+ %ds. 



•,Bc 



(13) 

( m (c+)- 1 + mc-)- 1 ), 



and p(C ± ) — f c± f(x)dx. 



Remark: 

(1) Compared to the limit expression on fc-NN graph, 
there is an additional term p(s) = (A + 2(1 — X)p(s)). 
The monotonicity of p(s) in p(s) immediately implies 
that the "infinitesimal" cut contribution at low(high) 
density areas is reduced (increased). To see the im- 
pact suppose A is small; we see that for cuts S near 
modes, p(s) « 1 and this extra term is nearly (2) 1+ ^. 
For S near valleys this term is nearly (A) 1+ ^ < 1. The 
cut-ratio q is explicitly reduced. 

(2) Smaller A further penalizes high density areas over 
low density areas, further reduces the cut-ratio q and 
pulls down (q,y) in Fig.l, thus has the ability to 
cope with even more unbalanced data (with smaller 
y). Therefore, without a priori information about how 
unbalanced the data is, parameterizing graphs with 
varying values of A provides for RCut(NCut) based 
algorithms the ability to adapt to data with varying 
levels of unbalancedness. 

5 Simulations 




Figure 3: Density level sets & rank estimates for unbal- 
anced and proximal gaussian mixtures. High/low ranks 
correspond to high/low density levels. 



Next we study RCut(NCut) induced on unweighted 
RMD graph. The limit cut expression on RMD graph 
involves an additional adjustable term which varies 
point-wise according to the density. For technical sim- 
plicity, we assume RMD graph ideally connects each 
point x to its deg(a;) closest neighbors. 

Theorem 3. Suppose some smoothness assumptions 
hold and S be a fixed hyperplane in R d . For un- 
weighted RMD graph, set the degrees of points accord- 
ing to Eq.$8ty, where A € (0,1) is a constant. Let 
p{x) = A + 2(1 — X)p(x). Assume k n /n — > 0. In 
case d—1, assume k n j ^fn — > oo; in case d >2 assume 
k n /\ogn — > oo. Then as n — » oo we have that: 



i nr 

, RCut n (S) 
"' n v n 



CdBs 



'<> (s)p(s) 1+ dds. 

(12) 



Experiments in this section involve both synthetic and 
real data sets. We focus on unbalanced data by ran- 
domly sampling from different classes in an unbalanced 
manner. As for traditional graphs w e also include b- 
matching graph ( Jebara et al. . 20091 ) with b = k. 



For clustering experiments we apply both RCut and 
NCut based SC, but focus on NCut since it is gener- 
ally known to perform better. We report performance 
by evaluating how well the clusters structures match 
the ground truth labels, as i s the standard criterion 
for partitional clustering ( Xul . 2005 ). For instance con- 
sider Tab.l where error rates for USPS symbols 1,8,3,9 
are tabulated. We parameterize various graphs and 
apply SC to get various partitions. Our model selec- 
tion scheme picks the partition according to Eq.®, 
AGNOSTIC to the correspondence between samples 
and symbols. Errors are then reported by looking at 
mis-associations. 

For SSL experiments we randomly pick labeled points 
among unbalancedly sampled data, guaranteeing at 
least one labeled from each class. SSL algorithms 
such as RCut-based GRF and NCut-based GTAM are 
applied on parameterized graphs built from partially 
labeled data, and generate various partitions. The 
model selection scheme picks the min-cut partition 
simply based on graph structures according to Eq.((9|). 
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Then labels for unlabeled data are predicted based on 
the selected parition and compared against the UN- 
KNOWN true labels to produce the error rates 

Some general simulation parameters are: 

(1) We employ U-statistic technique in rank compu- 
tation to reduce variance fSec !3.ip . with B = 5. 

(2) All error rate results are averaged over 20 trials. 
Other parameters will be specified below. 

5.1 Synthetic DataSets 

Consider a multi-cluster complex-shaped data set, 
which is composed of 1 small Gaussian and 2 moon- 
shaped proximal clusters shown in Figj4] Sample size 
n = 1000 with the rightmost small cluster 10% and 
two moons 45% each. In this example, for illustration 
we did not parameterize the graph or apply the model 
selection step. We fix A = 0.5, and choose k = I = 30, 
e = a = dk, where dk is the average fc-NN distance. 
Model-based approaches can fail on such dataset due 
to the complex shapes of clusters. The 3-partition SC 
based on RCut is applied. On fc-NN and ^-matching 
graphs SC fails for two reasons: (1) SC cuts at bal- 
anced positions and cannot detect the rightmost small 
cluster; (2) SC cannot recognize the long winding low- 
density regions between 2 moons because there are too 
many spurious edges and the Cut value along the curve 
is big. SC fails on e-graph(similar on full-RBF) be- 
cause the outlier point forms a singleton cluster, and 
also cannot recognize the low-density curve. RMD 
graph significantly sparsifies the graph at low-density 
regions, enabling SC to cut along the valley, detect the 
small cluster and is robust to outliers. 

5.2 Real DataSets 

We focus on unbalanced settings and consider several 
real datasets. We construct fc-NN, 6-match, full-RBF 
and RMD graphs all combined with RBF weights, but 
do not includ e the e-graph becau se of its overall poor 



performance ( Jebara et al. . 2009f) . We discretize not 
only A but also k, a to parameterize graphs. The 
sample size is around 750 to 1500, described respec- 
tively. We vary k in {10, 20, 30,..., 100}. Note that 
although small fc in our scheme may lead to discon- 
nected graphs due to minimum degree Afc in Eq.©, 
the resulting partitions with singleton clusters will be 
ruled out by the constraints of Eq.©. Also notice 
that for A = 1, RMD graph is identical to fc-NN 
graph. For RBF parameter a it has been suggested 
to be of the same scal e as the average fc-NN distance 
dk (jWang et al.1 . 120081) . This suggests a discretization 
of a as 2^dk with j = —3, —2, . . . , 3. We discretize 
A € {0.2,0.4,0.6,0.8,1}. In the model selection step 
Eq.dH), cut values of various partitions are evaluated 



RBFk-NN 
RBF b-matching 
full-RBF 
RBF RMD 




o RBFk-NN 

+ RBF b-matching 

> iull-RBF 

■ RBF RMD(our method) 




(a) SC 



(b) GTAM 



Figure 5: Error rate performance of SC and GTAM on 
USPS 8vs9 with varying levels of unbalancedness. We 
omitted GRF since the results are qualitatively similar. 
Our method adapts to different levels of unbalancedness 
much better than traditional graphs. Furthermore, when 
data is very unbalanced (big ng/ns), varying fc, a does not 
really help; decreasing A adapts the algorithm well. 



on a same fco-NN graph with fco = 30, a — d^o before 
selecting the min-cut partition. / is fixed to be 30. 
The true number of clusters/classes K is supposed to 
be known. We assume meaningful clusters are at least 
5% of the total number of point s, 8 = 0.05. We set 
the GTAM parameter fx = 0.05 (|jebara et all l2009h 
for the SSL tasks, and each time 20 randomly labeled 
samples are chosen with at least one sample from each 
class. 

Varying Unbalancedness: We start with a com- 
parison for 8vs9 of the 256-dim USPS digit data set. 
We keep the total sample size as 750, and vary the 
unbalancedness, i.e. the proportion of numbers of 
points randomly sampled from two classes, denoted by 
ns,n,g. Normalized SC and GTAM are applied. Figj5] 
shows that when the underlying clusters/classes are 
balanced, our method works as perfect as traditional 
graphs; as the unbalancedness increases, the perfor- 
mance severely degrades on traditional graphs, while 
our method can adapt the graph-based learning algo- 
rithms to different levels of unbalancedness very well. 

Other Real Data Sets: We apply SC and SSL algo- 
rithms on several other real data sets including USPS, 
waveform database generator(21-dim), Statlog landsat 
satellite images(36-dim), letter recognition images(16- 
dim) and optical recognit i on of handwritten digits(64- 
dim) ([Frank fc Asuncion! . [2010h . We randomly sam- 
ple 150/600, 200/400/600, 200/300/400/500 points for 
2,3,4-class cases, with corresponding orders of class 
indices listed in Tab ll|2l For comparison we also 
include the full graph with adaptive RBF weights 
(full-aRBF), where a u is chosen as the fc-NN dis- 
tance of node u, and w(u , v) = exp (—d(u, v) 2 /2a u a v ) 
(|Zelnik-Manor fc Peronal l2004h . Tab EEC] shows that 
varying fc, a for traditional graphs does not work well, 
while our method consistently performs better. 
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(a) fc-NN (b) ^-matching (c) e-graph(full-RBF) (d) RMD 

Figure 4: Clustering results of 3-partition SC on 2 moons and 1 gaussian data set. SC on full-RBF(e-graph) completely 
fails due to the outlier. For fe-NN and 6-matching graphs SC cannot recognize the long winding low-density regions 
between 2 moons, and fails to find the rightmost small cluster. Our method sparsifies the graph at low-density regions, 
allowing to cut along the valley, detect the small cluster and is robust to outliers. 



Table 1: Error rate performance of normalized SC on various graphs for unbalanced real data sets. Our method performs 
significantly better than other methods. 



Error Rates (%) 


USPS 


Satlmg 


OptDigit 


LetterRec 


8vs9 


1,8,3,9 


4vs3 


3,4,5 


1,4,7 


9vs8 


6vs8 


1,4,8,9 


6vs7 


6,7,8 


RBF fc-NN 


16.67 


13.21 


12.80 


18.94 


25.33 


9.67 


10.76 




26.76 


4.89 


37.72 


RBF fo-matching 


17.33 


12.75 


12.73 


18.86 


25.67 


10.11 


11.44 




28.53 


5.13 


38.33 


full-RBF 


19.87 


16.56 


18.59 


21.33 


34.69 


11.61 


15.47 




36.22 


7.45 


35.98 


full-aRBF 


18.35 


16.26 


16.79 


20.15 


35.91 


10.88 


13.27 




33.86 


7.58 


35.27 


RBF RMD 


4.80 


9.18 


7.87 


15.26 


19.72 


5.43 


6.67 




21.35 


2.92 


28.68 



5.3 Applications to Small Cluster Detection 

We illustrate how our method can be used to find 
small-size clusters. This type of problem may 
arise in community detection in large real networks, 
where graph-based approaches are po pular but small- 
size c ommunity detection is difficult (|Shah fc Zaman . 

20inh . 

The dataset depicted in FigJB] has 1 large and 2 
small proximal Gaussian components along x\ axis: 
Eti^f/ 1 !'^)' where oti : a 2 : a 3 = 2 : 8 : 1, 
/xi=[-0.7;0], a*2 = [4.5;0], /i 3 =[9.7;0], Si = I,E 2 = 
diag(2, 1), E 3 = 0.7/. Binary weight is adopted. 

FiglUa) shows a plot of cut values for different cut po- 
sitions averaged over 20 Monte Carlo runs. We note 
that the cut- value plot resembles the underlying den- 
sity. Two density valleys are both at unbalanced po- 
sitions. The rightmost cluster is smaller than the left 
cluster, but has a deeper valley. 

To apply our method we vary the cluster-size threshold 
5 in Eq.©. We now plot the Cut- value against S as 
shown in FigH^b). As seen in FigJHJb), when S > 0.3, 
the optimal cut is close to the valley. However, since 
the proportion of data samples in the smaller clusters 
is less than 30% we see that the optimal cut is bounded 
away from both valleys. As 5 is decreased in the range 
0.25 > S > 0.15, the optimal cut is now attained at 
the left valley(xi 1.8). An interesting phenomena is 



that the curve flattens out in this range. This corre- 
sponds to the fact that the cut value is minimized at 
this position {x\ = 1.8) for any value of 5 <G [.15, .25]. 
This flattening out can happen only at valleys since 
valleys represent a "local" minima for the model selec- 
tion step of Eq. |H] under the constraint imposed by 5. 
Consequently, small clusters can be detected based on 
the flat spots. Next when we further vary S in the re- 
gion 0.1 > S > 0.05, the best cut is attained near the 
right and deeper valley(xi « 8.2). Again the curve 
flattens out revealing another small cluster. 

5.4 Comments 

Tuning Parameters: A is a parameter that is op- 
timized through the model selection step and does 
not count as a tuning parameter(so are fc and a un- 
der our framework). The choice of 6 is based on 
our prior to find sizable clusters, say 5% to 10% of 
the data. As for fco and /, our method appears to 
be relatively insensitive to the values of ko,l. Un- 
like graph parameters A, fc, er which have direct im- 
pact on graph-based algorithms, fco is used to rela- 
tively compare different partitions and / is used to rel- 
atively order data points. It is not surprising that the 
relative ranking of high/low density cuts (or points 
near high/low density areas) does not substantially 
change when compared on a nearest neighbor graph 
with different fco (I), as is usually the case in our ex- 
periments. Similar phenomena have been observed 
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Table 2: Error rate performance of GRF and GTAM for unbalanced real data sets. Our method performs significantly 
better than other methods. 



TT.r-r-r.r- T)at«W.l 


USPS 


Satlmg 


OptDigit 


LetterRec 






8vso 


1 o o n 

1,8,3,9 


4vs3 


1,4,7 


ovs8 


8vs9 


6,1,8 


6vs7 


6,7,8 




RBF fc-NN 


5.70 


13.29 


14.64 


16.68 


5.68 


7.57 


7.53 


7.67 


28.33 


GRF 


RBF ^-matching 


6.02 


13.06 


13.89 


16.22 


5.95 


7.85 


7.92 


7.82 


29.21 




full-RBF 


15.41 


12.37 


14.22 


17.58 


5.62 


9.28 


7.74 


11.52 


28.91 




full-aRBF 


12.89 


11.74 


13.58 


17.86 


5.78 


8.66 


7.88 


10.10 


28.36 




RBF RMD 


1.08 


10.24 


9.74 


15.04 


2.07 


2.30 


5.82 


5.23 


27.24 




RBF fc-NN 


4.11 


10.88 


26.63 


20.68 


11.76 


5.74 


12.68 


19.45 


27.66 


GTAM 


RBF 6-matching 


3.96 


10.83 


27.03 


20.83 


12.48 


5.65 


12.28 


18.85 


28.01 




full-RBF 


16.98 


11.28 


18.82 


21.16 


13.59 


7.73 


13.09 


18.66 


30.28 




full-aRBF 


13.66 


10.05 


17.63 


22.69 


12.15 


7.44 


13.09 


17.85 


31.71 




RBF RMD 


1.22 


9.13 


18.68 


19.24 


5.81 


3.12 


10.73 


15.67 


25.19 




x t ( cut position ) 3 (cluster-size threshold) 



(a) Cut value vs. cut position (b) Cut value vs. Cluster size(tJ) (c) different clustering results 

Figure 6: 2-partition SC results of 1 large and 2 small proximal gaussian mixture components. Both valleys are at 
unbalanced positions. The rightmost cluster is smaller than the left, with a deeper valley. Results in (b) are from one 
run. As shown in (b) and (c), the left cluster is detected for a larger S, where the right smaller one is viewed as outliers. 
When even reducing 5, the right smaller one is detected(Eq.([9])). 



i n the context of hi g h-dimensional anomaly d e tectio n 
(|Zhao fc Saligramal 120091 : ISaligrama fc Zhaol l2012h . 
We fix ko = I roughly the same scale as %Jn in ex- 
periments. 

Time Complexity: The time complexity of li- 
st at istic rank computation is 0(Bdn 2 logn), where B 
is a small constant, 5 in our experiments. RMD graph 
construction is 0(dn 2 logn), same as constructing a k- 
NN graph. Computing Cut value and checking the siz- 
able cluster constraint for a partition takes 0{n 2 ). So 
if totally D graphs are parameterized and the graph- 
based learning algorithm needs T, the whole complex- 
ity is 0((B + D)dn 2 logn + DT). 

6 Conclusions 

We have shown that RCut(NCut) based spectral meth- 
ods on traditional graphs can lead to balanced cuts 
rather than density valley cuts for unbalanced proxi- 
mal data. We propose a systematic procedure to pa- 
rameterize graphs based on a rank-modulated degree 



(RMD) scheme, which adaptively sparsifies/densifies 
the neighborhoods of nodes. This scheme effectively 
adapts RCut(NCut) based methods to unbalanced 
data. We then present a model selection step which 
allows for best sizable clusters separated by smallest 
cut value. By constraining the smallest cluster sizes 
we can detect multiple small clusters and generate dif- 
ferent meaningful cuts. Our synthetic and real simu- 
lations demonstrate significant performance improve- 
ments over existing methods for unbalanced data. The 
ability to detect small-size clusters indicates our idea 
may be utilized in other applications such as commu- 
nity detection in large networks, where graph-based 
approaches are popular but small-size community de- 
tection is difficult. 
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Appendix: Proofs of Theorems 

For ease of development, let n = mi(m,2 + 1), and divide n data points into: D = Dq [J D\ (J ... (J D mi , where 
Dq = {xi, x mi }, and each Dj,j = 1, ...,mi involves m-i points. Dj is used to generate the statistic G for u 
and Xj G Dq, for j = 1, m\. Dq is used to compute the rank of u: 

^ mi 

R ^ = — H I {G(^;D J )>G(«;D J )} (14) 
1711 3 = 1 

We provide the proof for the statistic G(u) of the following form: 

i m* 

<?(«;£,•) = - UJ (15) 

where -D(i) (u) denotes the distance from u to its i-th nearest neighbor among mi points in Dj . Practically we 
can omit the weight as Eq. (fT0|) in the paper. 



Regularity conditions: /(•) is continuous and lower-bounded: f(x) > f m in > 0. It is smooth, i.e. 
||V/(a;)|| < A, where V/(a;) is the gradient of /(•) at x. Flat regions are disallowed, i.e. Vx G X, Vcr > 0, 
V {y : \ f(y) — f(x)\ < <r} < Ma, where M is a constant. 

Proof of Theorem 1: 

Proof. The proof involves two steps: 

1. The expectation of the empirical rank E [R(u)} is shown to converge to p(u) as n — >• oo. 

2. The empirical rank R(u) is shown to concentrate at its expectation as n — > oo. 

The first step is shown through Lemma [5J For the second step, notice that the rank R(u) — SjSi Yj, where 
Yj = ^{G(x j ;D j )>G(,u-,D j )} is independent across different j's, and Yj G [0, 1]. By Hoeffding's inequality, we have: 

P (\R(u) - E [R(u)\ | > e) < 2 exp (-2mie 2 ) (16) 

Combining these two steps finishes the proof. □ 

Proof of Theorem 2: 

Proof. We only present a brief outline of the proof. We want to establish the convergence result of the cut term 
and the balancing terms respectively, that is: 

^-^cut n (S) -+ C d J s f 1 -Hs)p(s)^U S . (17) 

n JV^^7KU^- ( 18 ) 

where V + (V~) = {x € V : x G C + (C - )} are the discrete version of C + (C~). 

The balancing terms Eg. (|18I19|) are obtained similarly using Chernoff bound on the sum of binomial random 
variables, since the number of points in is binomially distributed Binom(n, ^(C^)). Details can be found in 



Maier et al.l (|2008aD . 



Ea. dTTl) is established in two steps. First we can show that the LHS cut term converges to its expectation 
E \ ^n^-cut„(S)\ by McDiarmid's inequality. This can also be found in Maier et al. (|2008al) . Second we 
show this expectation term actually converges to the RHS of Eq. (|T7|) . This is shown in Lemma |4j 

□ 
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Lemma 4. Given the assumptions of Theorem 2, 

E ( JL J^cut n (S)) -^°dj s f 1 ^(s)p(s) 1+ ^ds. 



(20) 



where Ca — 



2<?d- 



l+l/d ■ 



Proof. The proof is a simple extension of iMaier et al. | (|2008bl) . We provide an outline here. The first trick is to 
define a cut function for a fixed point xi e V + , whose expectation is easier to compute: 



cut Xi = 2J u;(a;i,w). 
vev-,(xi,t))eB 



(21) 



Similarly, we can define eut^ for x, € V . The expectation of cut Xi and cut n {S) can be related: 

E(cui„(S)) = nE. T (E(cu^)) (22) 
Then the value of E(cut Xi ) can be computed as, 



(n-1) 



B(xi,r)nC- 



f(y)dy 



dF B ,(r). 



(23) 



where r is the distance of Xi to its k n p(xi)-th nearest neighbor. The value of r is a random variable and can be 
characterized by the CDF F R k (r) . Combining equation [22] we can write down the whole expected cut value 



l(cut n (S)) = nE x (E(cut x )) = n f (x)E{cut x )dx 



i(n-l) / /(x) 



g(x,r)dF R k (r) 



dx. 



To simplify the expression, we use g(x,r) to denote 



g(x, r) = {J B{xr)nc - f(y)dy, x <= 



B{x,r)nC+ 



f(y)dy,x e C~ 



(24) 
(25) 

(26) 



Under general assumptions, when n tends to infinity, the random variable r will highly concentrate around its 
mean E(r k ). Furthermore, as k n /n — > 0, E(r k ) tends to zero and the speed of convergence 



E(r*) « (kp(x)/((n - l)/(x) % ))^ 
So the inner integral in the cut value can be approximated by g(x,E(r k )), which implies, 



E(cut n (5)) «n(n- 1) / f(x)g(x,E(r k ))dx 



(27) 



(28) 



The next trick is to decompose the integral over M. d into two orthogonal directions, i.e., the direction along the 
hyperplane S and its normal direction (We use it to denote the unit normal vector) : 



f(x)g(x,E(r x ))dx 



r+co 
IS J -co 



f(s + tlf)g(s + tlt,E(r k +tJ f))dtds. 



(29) 



When t > E(r k —\), the integral region of g will be empty: B(x,E(r k )) DC =0. On the other hand, when 
x = s + tit is close to s £ S, we have the approximation /(x) ~ /(s): 



rr /(« + + E ( rk s+tl t)) dt 

2 r E( '' ;) /(*) [/(a)vol (B(s + tit, Er k ) n C~)] dt 
= 2f 2 (s) J E(r?) vol (B(s + tlt,E(r k )) fl C~) dt. 



(30) 
(31) 
(32) 
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The term vol (B(s + t n ,E(rJ)) nC ) is the volume of d-dim spherical cap of radius E(r^)), which is at distance 
t to the center. Through direct computation we obtain: 



L 



d+1 



vol (B(s + tit, E(r k J) n C~) dt = E{r k s )° 

Combining the above step and plugging in the approximation of E(rj) in Eq. (|27|) , we finish the proof. 
Lemma 5. By choosing I properly, as rri2 — > oo, it follows that, 

\E[R(u)]-p(u)\ — > 

Proof. Take expectation with respect to D: 



(33) 
□ 



Ed [R(u)] 



-D\D 



Er 



mi 

— ^hciu-D.XGixy^D, 



)} 



3=1 



^ mi 

— ^E x . [E^ ^{GCuiOjXG^;^)}]] 



3 = 1 



lx[P Dl (G(u;Di) < G(x;D\))} 



(34) 

(35) 
(36) 



The last equality holds due to the i.i.d symmetry of {x±, ...,x mi } and D\, D mi . We fix both u and x and 
temporarily discarding Ed x . Let F x (y%, y m2 ) = G{x) — G(u), where yi,...,y m2 are the points in D%. It 
follows: 

V Dl (G(u) < G(x)) = V Dl (F x ( yi , ...,y m2 ) >0) = V Dl (F x - EF X > -EF X ) . (37) 



To check McDiarmid's requirements, we replace yj with y'j. It is easily verified that Vj = 1, ...,rri2, 

i 2C AC 

\F x (yi, ...,J/ TO2 ) - F x (y 1 , y' p t/ m J| < 23 — < — 



(38) 



where C is the diameter of support. Notice despite the fact that y\, y m2 are random vectors we can still apply 
MeDiarmid's inequality, because according to the form of G, F x {y\, y m2 ) is a function of 7712 i-i-d random 
variables n, r m2 where r, is the distance from x to yj. Therefore if EF X < 0, or EG(x) < EG(u), we have by 
McDiarmid's inequality, 



V Dl (G(u) < G(x)) = V Dl (F x > 0) = V Dl (F x - EF X > -EF X ) < exp 
Rewrite the above inequality as: 



(EF^fl 2 
8C 2 



m 2 



I 



{EF x >0} 



e sc ^2 < V Dl (F x > 0) < I 



+ e 8C ' 2m 2 



(39) 



(40) 



It can be shown that the same inequality holds for EF X > 0, or EG(x) > EG(u). Now we take expectation with 
respect to x: 



V x (EF X > 0) - E 2 



e 8C2 ™2 



< E [P Dl (F x > 0)] < V x (EF X > 0) + E a 



(E£x)_i_ 

" 8C2 m2 



(41) 



Divide the support of a; into two parts, Xi and X2, where Xi contains those 2: whose density f{x) is relatively 
far away from f(u), and X2 contains those x whose density is close to f(u). We show for x € Xi, the above 
exponential term converges to and P (EF X > 0) = V x (f( u ) > fi x ))i while the rest x £ X2 has very small 

measure. Let A(x) — f jct^ — J • By Lemma [5] we have: 
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jJ— J . Applying uniform bound we have: 
A(x) - A(u) - 2 (j^j (J-^j " < E [G{x) - G{u)] < A(x) - A(u) + 2 (j^J (J^j ' (43) 

Now let Xi = {a; : |/(a;) - f{u)\ > ilidf^ n {^Y}- For x € Xi, it can be verified that \A(x) - A(u)\ > 
3 ( -rfz ) (^) " , or |E [G(x) - G(u)] \ > ( jfa j " , and !{/(„)> /( x )} = I{eg(z)>eg(«)}- For the exponential 



term in Equ. (|40[ ) we have: 



r,s~t1 ] — r I 2 i _,_ 

2C 2 m 2 J V 8C <a c 3 *+t 



•"d '"2 



/ m i„}, by the regularity assumption, we have "P(X 2 ) < 
^- j f m d in - Combining the two cases into Equ. (jlT|) we have for upper bound: 

E D [R(u)} = E x [P Dl (G(u)<G(x))} (45) 
V Dl (G(u) < G(x)) f(x)dx + [ V Dl (G{u) < G(x)) f{x)dx (46) 



< ( V x (/(«) > /Or)) + exp ( 7l 'f + \ \\v{x£ X x ) + V{x € X 2 ) (47) 

V V %C 2 c*m\ +d j) 

( ^2/2+| \ j+i / / \ 3 

< (/(«) > /(a)) + exp 71 ; + 3M 7l dfJ- n — (48) 

Let / = my such that < a < 1, and the latter two terms will converge to as mi — >■ oo. Similar lines hold 
for the lower bound. The proof is finished. □ 

/ \ 1 / d f \ 1 / d 

Lemma 6. Let A(x) — ( mCd /(a;) ) > = ( c d / 5 J • ify choosing I appropriately, the expectation of 
l-NN distance ¥,D^(x) among m points satisfies: 



\ED {l) (x) - A(x)\ =0 Ia(x)X 1 (JA 



(49) 



Proof. Denote r(x,a) = min{r : V (B(x,r)) > a}. Let S m -) as m -) oo, and < S m < 1/2. Let 
U ~ Bin(m, (1 + <5 m )— ) be a binomial random variable, with EC/ = (1 + 5 m )l. We have: 

v(D (l) (x)>r(x,(l + S m )-)] = V{U<1) (50) 



m 

" K^O-TTt)* 1 ^™") < 51 » 

The last inequality holds from Chernoff's bound. Abbreviate r\ = r{x, (1 + S m )— ), and KDm(x) can be bounded 
as: 



ED (0 (x) < r 1 [l-P(D (0 (x)>r 1 )]+CP(i?(i)(a!)>ri) (53) 

2(l + <5 m ) 



< r x + C exp 1 \ ) (54) 
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where C is the diameter of support. Similarly we can show the lower bound: 

K ~ ( Sll 



ED (l) (x) > r(x, (1 - S rn )-) -Ccxp [~ 2{1 ^ Sm) ) (55) 

Consider the upper bound. We relate n with A(x). Notice V (B(x, ri)) = (1 + <5 m )^ > Cd,rff m i n , so a fixed 

but loose upper bound is n < {^j^^j = r mor Assume //to is sufficiently small so that n is sufficiently 
small. By the smoothness condition, the density within B(x, n) is lower-bounded by /(x) — Ari, so we have: 

V(B(x, ri )) = (l + 5 m )- (56) 



> cr?(/(aO-Ari) (57) 
A 

7, 



> c d rf/(z) fl - ^-w) (59) 



That is: 

1 + 5. 



n < A(x) ( — (60) 

fmin maX J 

l/d 



Insert the expression of r max and set Ai = 7 — ( — r^ - I > we have: 

Jmin \ Cdjmin J 



ED<„(* } - A( X) < A(x) I \ l X l (L ) y) ' - ) +C °" ("2(ifb) ,61) 

s ^r^^M-A) ,62) 

3d+8 1 

The last equality holds if we choose I = m id + 8 and 5 m = mT^ . Similar lines follow for the lower bound. Combine 
these two parts and the proof is finished. 

□ 



